The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards
Objective
Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank Data Dictionary:
Importing Libraries
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
#libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier)
from xgboost import XGBClassifier
#creating DataFrame
df = pd.read_csv("BankChurners.csv")
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 10127 non-null object 6 Marital_Status 10127 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
#checking unique values in the data set.
df.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 7 Marital_Status 4 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
# Dropping columns - CLIENTNUM
df.drop(columns=["CLIENTNUM"], inplace=True)
df.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Getting 5 point summary of the data.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# Making a list of all categorical variables
cat_col = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category"
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(df[column].value_counts())
print("-" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ----------------------------------------
# replacing College with Graduate
df["Education_Level"] = df["Education_Level"].replace("College", "Graduate")
df[df["Education_Level"] == "Unknown"]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Existing Customer | 51 | M | 4 | Unknown | Married | $120K + | Gold | 46 | 6 | 1 | 3 | 34516.0 | 2264 | 32252.0 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 11 | Existing Customer | 65 | M | 1 | Unknown | Married | $40K - $60K | Blue | 54 | 6 | 2 | 3 | 9095.0 | 1587 | 7508.0 | 1.433 | 1314 | 26 | 1.364 | 0.174 |
| 15 | Existing Customer | 44 | M | 4 | Unknown | Unknown | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 17 | Existing Customer | 41 | M | 3 | Unknown | Married | $80K - $120K | Blue | 34 | 4 | 4 | 1 | 13535.0 | 1291 | 12244.0 | 0.653 | 1028 | 21 | 1.625 | 0.095 |
| 23 | Existing Customer | 47 | F | 4 | Unknown | Single | Less than $40K | Blue | 36 | 3 | 3 | 2 | 2492.0 | 1560 | 932.0 | 0.573 | 1126 | 23 | 0.353 | 0.626 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10090 | Existing Customer | 36 | F | 3 | Unknown | Married | $40K - $60K | Blue | 22 | 5 | 3 | 3 | 12958.0 | 2273 | 10685.0 | 0.608 | 15681 | 96 | 0.627 | 0.175 |
| 10094 | Existing Customer | 59 | M | 1 | Unknown | Single | $60K - $80K | Blue | 48 | 3 | 1 | 2 | 7288.0 | 0 | 7288.0 | 0.640 | 14873 | 120 | 0.714 | 0.000 |
| 10095 | Existing Customer | 46 | M | 3 | Unknown | Married | $80K - $120K | Blue | 33 | 4 | 1 | 3 | 34516.0 | 1099 | 33417.0 | 0.816 | 15490 | 110 | 0.618 | 0.032 |
| 10118 | Attrited Customer | 50 | M | 1 | Unknown | Unknown | $80K - $120K | Blue | 36 | 6 | 3 | 4 | 9959.0 | 952 | 9007.0 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10123 | Attrited Customer | 41 | M | 2 | Unknown | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
1519 rows × 20 columns
df[df['Marital_Status']=='Unknown']
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 7 | Existing Customer | 32 | M | 0 | High School | Unknown | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.0 | 1396 | 27685.0 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 10 | Existing Customer | 42 | M | 5 | Uneducated | Unknown | $120K + | Blue | 31 | 5 | 3 | 2 | 6748.0 | 1467 | 5281.0 | 0.831 | 1201 | 42 | 0.680 | 0.217 |
| 13 | Existing Customer | 35 | M | 3 | Graduate | Unknown | $60K - $80K | Blue | 30 | 5 | 1 | 3 | 8547.0 | 1666 | 6881.0 | 1.163 | 1311 | 33 | 2.000 | 0.195 |
| 15 | Existing Customer | 44 | M | 4 | Unknown | Unknown | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10070 | Existing Customer | 47 | M | 3 | High School | Unknown | $80K - $120K | Silver | 40 | 5 | 3 | 2 | 34516.0 | 1371 | 33145.0 | 0.691 | 15930 | 123 | 0.836 | 0.040 |
| 10100 | Existing Customer | 39 | M | 2 | Graduate | Unknown | $60K - $80K | Silver | 36 | 4 | 2 | 2 | 29808.0 | 0 | 29808.0 | 0.669 | 16098 | 128 | 0.684 | 0.000 |
| 10101 | Existing Customer | 42 | M | 2 | Graduate | Unknown | $40K - $60K | Blue | 30 | 3 | 2 | 5 | 3735.0 | 1723 | 2012.0 | 0.595 | 14501 | 92 | 0.840 | 0.461 |
| 10118 | Attrited Customer | 50 | M | 1 | Unknown | Unknown | $80K - $120K | Blue | 36 | 6 | 3 | 4 | 9959.0 | 952 | 9007.0 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10125 | Attrited Customer | 30 | M | 2 | Graduate | Unknown | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
749 rows × 20 columns
df[df['Dependent_count']==0]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | Existing Customer | 32 | M | 0 | High School | Unknown | $60K - $80K | Silver | 27 | 2 | 2 | 2 | 29081.0 | 1396 | 27685.0 | 2.204 | 1538 | 36 | 0.714 | 0.048 |
| 21 | Attrited Customer | 62 | F | 0 | Graduate | Married | Less than $40K | Blue | 49 | 2 | 3 | 3 | 1438.3 | 0 | 1438.3 | 1.047 | 692 | 16 | 0.600 | 0.000 |
| 34 | Existing Customer | 58 | M | 0 | Graduate | Married | $80K - $120K | Blue | 49 | 6 | 2 | 2 | 12555.0 | 1696 | 10859.0 | 0.519 | 1291 | 24 | 0.714 | 0.135 |
| 39 | Attrited Customer | 66 | F | 0 | Doctorate | Married | Unknown | Blue | 56 | 5 | 4 | 3 | 7882.0 | 605 | 7277.0 | 1.052 | 704 | 16 | 0.143 | 0.077 |
| 52 | Existing Customer | 66 | F | 0 | High School | Married | Less than $40K | Blue | 54 | 3 | 4 | 2 | 3171.0 | 2179 | 992.0 | 1.224 | 1946 | 38 | 1.923 | 0.687 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10097 | Existing Customer | 31 | M | 0 | High School | Single | $40K - $60K | Blue | 25 | 3 | 2 | 3 | 4493.0 | 1388 | 3105.0 | 0.795 | 17744 | 104 | 0.763 | 0.309 |
| 10106 | Existing Customer | 58 | F | 0 | Graduate | Single | Less than $40K | Blue | 48 | 2 | 2 | 5 | 4299.0 | 1334 | 2965.0 | 0.660 | 15068 | 123 | 0.685 | 0.310 |
| 10107 | Attrited Customer | 61 | M | 0 | Graduate | Single | $60K - $80K | Blue | 54 | 2 | 1 | 4 | 11859.0 | 1644 | 10215.0 | 0.866 | 8930 | 79 | 0.837 | 0.139 |
| 10113 | Attrited Customer | 27 | M | 0 | High School | Divorced | $60K - $80K | Blue | 36 | 2 | 3 | 2 | 13303.0 | 2517 | 10786.0 | 0.929 | 10219 | 85 | 0.809 | 0.189 |
| 10114 | Existing Customer | 29 | M | 0 | Graduate | Married | Less than $40K | Blue | 15 | 3 | 1 | 5 | 4700.0 | 0 | 4700.0 | 0.617 | 14723 | 96 | 0.655 | 0.000 |
904 rows × 20 columns
groupdata = df.groupby(by=['Dependent_count'])['Marital_Status']
groupdata.describe()
| count | unique | top | freq | |
|---|---|---|---|---|
| Dependent_count | ||||
| 0 | 904 | 4 | Single | 399 |
| 1 | 1838 | 4 | Married | 832 |
| 2 | 2655 | 4 | Married | 1284 |
| 3 | 2732 | 4 | Married | 1251 |
| 4 | 1574 | 4 | Married | 727 |
| 5 | 424 | 4 | Married | 206 |
# above data also suggest that Single is the most common value when dependent count is 0.
# Updating the values for Marital status
df.loc[df.Dependent_count==0, 'Marital_Status'] = "Single"
df.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df[df['Marital_Status']=='Unknown']
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 10 | Existing Customer | 42 | M | 5 | Uneducated | Unknown | $120K + | Blue | 31 | 5 | 3 | 2 | 6748.0 | 1467 | 5281.0 | 0.831 | 1201 | 42 | 0.680 | 0.217 |
| 13 | Existing Customer | 35 | M | 3 | Graduate | Unknown | $60K - $80K | Blue | 30 | 5 | 1 | 3 | 8547.0 | 1666 | 6881.0 | 1.163 | 1311 | 33 | 2.000 | 0.195 |
| 15 | Existing Customer | 44 | M | 4 | Unknown | Unknown | $80K - $120K | Blue | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 26 | Existing Customer | 59 | M | 1 | High School | Unknown | $40K - $60K | Blue | 46 | 4 | 1 | 2 | 14784.0 | 1374 | 13410.0 | 0.921 | 1197 | 23 | 1.300 | 0.093 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10070 | Existing Customer | 47 | M | 3 | High School | Unknown | $80K - $120K | Silver | 40 | 5 | 3 | 2 | 34516.0 | 1371 | 33145.0 | 0.691 | 15930 | 123 | 0.836 | 0.040 |
| 10100 | Existing Customer | 39 | M | 2 | Graduate | Unknown | $60K - $80K | Silver | 36 | 4 | 2 | 2 | 29808.0 | 0 | 29808.0 | 0.669 | 16098 | 128 | 0.684 | 0.000 |
| 10101 | Existing Customer | 42 | M | 2 | Graduate | Unknown | $40K - $60K | Blue | 30 | 3 | 2 | 5 | 3735.0 | 1723 | 2012.0 | 0.595 | 14501 | 92 | 0.840 | 0.461 |
| 10118 | Attrited Customer | 50 | M | 1 | Unknown | Unknown | $80K - $120K | Blue | 36 | 6 | 3 | 4 | 9959.0 | 952 | 9007.0 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10125 | Attrited Customer | 30 | M | 2 | Graduate | Unknown | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
706 rows × 20 columns
# replacing the remaining values with Marrried as it is the most common for remaining records.
df["Marital_Status"] = df["Marital_Status"].replace("Unknown", "Married")
# checking if there are any pending records.
df[df['Marital_Status']=='Unknown']
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio |
|---|
# we will replace the Unknown values with null for income category and later use KNN imputer to impute the null values.
df["Income_Category"] = df["Income_Category"].replace("Unknown", np.nan)
df.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Married | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
# Observations on Customer_age
histogram_boxplot(df["Customer_Age"])
# Observations on Dependent_count
histogram_boxplot(df["Dependent_count"])
# Observations on Months_on_book
histogram_boxplot(df["Months_on_book"])
# Observations on Total_Relationship_Count
histogram_boxplot(df["Total_Relationship_Count"])
# Observations on Months_Inactive_12_mon
histogram_boxplot(df["Months_Inactive_12_mon"])
# Observations on Contacts_Count_12_mon
histogram_boxplot(df["Contacts_Count_12_mon"])
# Observations on Credit_Limit
histogram_boxplot(df["Credit_Limit"])
# Observations on Total_Revolving_Bal
histogram_boxplot(df["Total_Revolving_Bal"])
# Observations on Avg_Open_To_Buy
histogram_boxplot(df["Avg_Open_To_Buy"])
# Observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(df["Total_Amt_Chng_Q4_Q1"])
# Capping values for Total_Amt_Chng_Q4_Q1 at 2.5
df["Total_Amt_Chng_Q4_Q1"].clip(upper=2.5, inplace=True)
# Observations on Total_Trans_Amt
histogram_boxplot(df["Total_Trans_Amt"])
# Observations on Total_Trans_Ct
histogram_boxplot(df["Total_Trans_Ct"])
# Observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(df["Total_Ct_Chng_Q4_Q1"])
# Observations on Avg_Utilization_Ratio
histogram_boxplot(df["Avg_Utilization_Ratio"])
def perc_on_bar(feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
# Creating a countplot for the feature
sns.set(rc={"figure.figsize": (10, 5)})
ax = sns.countplot(x=feature, data=df)
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.1 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=14) # annotate the percantage
plt.show() # show the plot
# observations on Gender
perc_on_bar(df["Gender"])
# observations on Dependent_count
perc_on_bar(df["Dependent_count"])
# observations on Education_Level
perc_on_bar(df["Education_Level"])
# observations on Marital_Status
perc_on_bar(df["Marital_Status"])
# observations on Income_Category
perc_on_bar(df["Income_Category"])
# observations on Card_Category
perc_on_bar(df["Card_Category"])
# observations on Attrition_Flag
perc_on_bar(df["Attrition_Flag"])
sns.pairplot(df, hue="Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x283315ac0d0>
cols = df[
[
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
]
].columns.tolist()
plt.figure(figsize=(15, 20))
for i, variable in enumerate(cols):
plt.subplot(6, 2, i + 1)
sns.boxplot(df["Attrition_Flag"], df[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
df.corr(),
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
# Dropping columns
df.drop(
columns=[
"Customer_Age",
"Avg_Open_To_Buy",
"Total_Trans_Ct",
"Avg_Utilization_Ratio",
],
inplace=True,
)
# Making a list of all categorical variables
cat_col = [
"Attrition_Flag",
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category"
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(df[column].value_counts())
print("-" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 4141 High School 2013 Unknown 1519 Uneducated 1487 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 5006 Single 4448 Divorced 673 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ----------------------------------------
df.head()
| Attrition_Flag | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Ct_Chng_Q4_Q1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 1.335 | 1144 | 1.625 |
| 1 | Existing Customer | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 1.541 | 1291 | 3.714 |
| 2 | Existing Customer | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 2.500 | 1887 | 2.333 |
| 3 | Existing Customer | F | 4 | High School | Married | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 1.405 | 1171 | 2.333 |
| 4 | Existing Customer | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 2.175 | 816 | 2.500 |
Attrition_Flag = {'Existing Customer':0, 'Attrited Customer':1}
df['Attrition_Flag']=df['Attrition_Flag'].map(Attrition_Flag)
Gender = {'F':1,'M':2}
df['Gender']=df['Gender'].map(Gender)
Marital_Status = {'Married':1,'Single':2,'Divorced':3}
df['Marital_Status']=df['Marital_Status'].map(Marital_Status)
Income_Category = {'Less than $40K':1,'$40K - $60K':2,'$60K - $80K':3,'$80K - $120K':4,'$120K +':5}
df['Income_Category']=df['Income_Category'].map(Income_Category)
Card_Category = {'Blue':1,'Silver':2,'Gold':3,'Platinum':4}
df['Card_Category']=df['Card_Category'].map(Card_Category)
Education_Level = {'Graduate':1,'High School':2,'Uneducated':3,'Post-Graduate':4,'Doctorate':5,'Unknown':np.nan}
df['Education_Level']=df['Education_Level'].map(Education_Level)
setting Unknown education level to null. We will then use KNN imputer to find the best value to impute
df.isna().sum()
Attrition_Flag 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 0 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 dtype: int64
# Separating target variable and other variables
X = df.drop(columns="Attrition_Flag")
Y = df["Attrition_Flag"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7088, 15) (3039, 15)
imputer = KNNImputer(n_neighbors=5)
#Fit and transform the train data
X_train=pd.DataFrame(imputer.fit_transform(X_train),columns=X_train.columns)
#Transform the test data
X_test=pd.DataFrame(imputer.transform(X_test),columns=X_test.columns)
#Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 dtype: int64 ------------------------------ Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 dtype: int64
## Function to inverse the encoding
def inverse_mapping(x,y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
inverse_mapping(Gender,'Gender')
inverse_mapping(Marital_Status,'Marital_Status')
inverse_mapping(Income_Category,'Income_Category')
inverse_mapping(Card_Category,'Card_Category')
inverse_mapping(Education_Level,'Education_Level')
cols = X_train.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_train[i].value_counts())
print('*'*30)
F 3770 M 3318 Name: Gender, dtype: int64 ****************************** Graduate 3074 High School 2139 Uneducated 1194 Post-Graduate 369 Doctorate 312 Name: Education_Level, dtype: int64 ****************************** Married 3508 Single 3121 Divorced 459 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 2638 $40K - $60K 1510 $80K - $120K 1225 $60K - $80K 1200 $120K + 515 Name: Income_Category, dtype: int64 ****************************** Blue 6621 Silver 375 Gold 78 Platinum 14 Name: Card_Category, dtype: int64 ******************************
cols = X_test.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_test[i].value_counts())
print('*'*30)
F 1588 M 1451 Name: Gender, dtype: int64 ****************************** Graduate 1339 High School 877 Uneducated 529 Post-Graduate 155 Doctorate 139 Name: Education_Level, dtype: int64 ****************************** Married 1498 Single 1327 Divorced 214 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 1124 $40K - $60K 670 $80K - $120K 507 $60K - $80K 507 $120K + 231 Name: Income_Category, dtype: int64 ****************************** Blue 2815 Silver 180 Gold 38 Platinum 6 Name: Card_Category, dtype: int64 ******************************
X_train=pd.get_dummies(X_train,drop_first=True)
X_test=pd.get_dummies(X_test,drop_first=True)
print(X_train.shape, X_test.shape)
(7088, 24) (3039, 24)
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
train_f1 = metrics.f1_score(train_y,pred_train)
test_f1 = metrics.f1_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
print("F1 on training set : ",metrics.f1_score(train_y,pred_train))
print("F1 on test set : ",metrics.f1_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (4,3))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
LogisticRegression(random_state=1)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.figure(figsize=(6, 6))
plt.boxplot(cv_result_bfr)
plt.show()
Model's recall score is in the range of 0.30 to 0.39
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(lr,y_test)
Accuracy on training set : 0.8797968397291196 Accuracy on test set : 0.8719973675551168 Recall on training set : 0.36611062335381916 Recall on test set : 0.3319672131147541 Precision on training set : 0.7623400365630713 Precision on test set : 0.72 F1 on training set : 0.49466192170818507 F1 on test set : 0.45441795231416543
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1139 Before Under Sampling, counts of label 'No': 5949 After Under Sampling, counts of label 'Yes': 1139 After Under Sampling, counts of label 'No': 1139 After Under Sampling, the shape of train_X: (2278, 24) After Under Sampling, the shape of train_y: (2278,)
log_reg_under = LogisticRegression(random_state = 1)
log_reg_under.fit(X_train_un,y_train_un )
LogisticRegression(random_state=1)
##KFold Cross validation
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=log_reg_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.figure(figsize=(4, 4))
plt.boxplot(cv_result_under)
plt.show()
Recall does improve on train set and ranges between 0.73 to 0.741
#checking performance on test set.
get_metrics_score(log_reg_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_under,y_test)
Accuracy on training set : 0.7423178226514486 Accuracy on test set : 0.7581441263573544 Recall on training set : 0.7436347673397717 Recall on test set : 0.7090163934426229 Precision on training set : 0.7416812609457093 Precision on test set : 0.36847710330138445 F1 on training set : 0.7426567295046032 F1 on test set : 0.48493342676944645
Models performance seemed to have improved. We see Recall on Test set is 0.70. Also, there doesn't seem to be any overfit as the difference between test and train performance is not far off.
from imblearn.over_sampling import SMOTE
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over==1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 1139 Before UpSampling, counts of label 'No': 5949 After UpSampling, counts of label 'Yes': 5949 After UpSampling, counts of label 'No': 5949 After UpSampling, the shape of train_X: (11898, 24) After UpSampling, the shape of train_y: (11898,)
log_reg_over = LogisticRegression(random_state = 1)
# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_over,y_train_over)
LogisticRegression(random_state=1)
# using kfold cross validation
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_over=cross_val_score(estimator=log_reg_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.figure(figsize=(4, 4))
plt.boxplot(cv_result_over)
plt.show()
#Calculating different metrics
get_metrics_score(log_reg_over,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_over,y_test)
Accuracy on training set : 0.7973609009917633 Accuracy on test set : 0.7815070746956235 Recall on training set : 0.7920658934274668 Recall on test set : 0.639344262295082 Precision on training set : 0.800543662928984 Precision on test set : 0.39 F1 on training set : 0.796282213772708 F1 on test set : 0.48447204968944096
# Choose the type of classifier.
lr_estimator = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
lr_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator.fit(X_train_over, y_train_over)
LogisticRegression(C=0.1, random_state=1, solver='saga')
#Calculating different metrics
get_metrics_score(lr_estimator,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(lr_estimator,y_test)
Accuracy on training set : 0.7029752899646999 Accuracy on test set : 0.7857847976307996 Recall on training set : 0.5629517565977475 Recall on test set : 0.5081967213114754 Precision on training set : 0.7819285547513425 Precision on test set : 0.37632776934749623 F1 on training set : 0.6546129788897577 F1 on test set : 0.43243243243243246
# training models by creating pipeline
from sklearn.ensemble import BaggingClassifier
models = []
models.append(
(
"LR",
Pipeline(
steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1)),
]
),
)
)
models.append(
(
"RF",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"GBM",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"ADB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
]
),
)
)
models.append(
(
"Bagging",
Pipeline(
steps=[
("scaler", StandardScaler()),
("bagging", BaggingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
LR: 42.58250251178607 RF: 67.33827961975423 GBM: 74.97449571064224 ADB: 72.16941030991576 XGB: 81.29608161372595 Bagging: 74.80214854316408 DTREE: 73.66257052322436
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(10,50,10),'xgbclassifier__scale_pos_weight':[9,10,11],
'xgbclassifier__learning_rate':[0.01,0.1,0.2], 'xgbclassifier__gamma':[1,3,5],
'xgbclassifier__subsample':[0.8,0.9,1.0]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'xgbclassifier__gamma': 3, 'xgbclassifier__learning_rate': 0.01, 'xgbclassifier__n_estimators': 10, 'xgbclassifier__scale_pos_weight': 11, 'xgbclassifier__subsample': 0.8} with CV score=0.947325913903702:
Wall time: 5min 8s
# Creating new pipeline with best parameters
xgb_tunedGS = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=10,
scale_pos_weight=11,
subsample=0.8,
learning_rate=0.01,
gamma=3,
eval_metric='logloss',
),
)
# Fit the model on training data
xgb_tunedGS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=3, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=10,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=11,
subsample=0.8, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tunedGS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(xgb_tunedGS, y_test)
Accuracy on training set : 0.9190620272314675 Accuracy on test set : 0.8808818690358671 Recall on training set : 0.956631366616238 Recall on test set : 0.9221311475409836 Precision on training set : 0.8897748592870544 Precision on test set : 0.5813953488372093 F1 on training set : 0.9219927095990278 F1 on test set : 0.7131537242472267
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(10,100,20),
'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05],
'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1],
'xgbclassifier__max_depth':np.arange(1,10,1),
'xgbclassifier__reg_lambda':[0,1,2,5,10]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 0.8, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__reg_lambda': 10, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__max_depth': 5, 'xgbclassifier__learning_rate': 0.1, 'xgbclassifier__gamma': 5} with CV score=0.9499613571373366:
Wall time: 1min
# Creating new pipeline with best parameters
xgb_tunedRS = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
gamma=5,
subsample=0.8,
learning_rate= 0.01,
eval_metric='logloss', max_depth = 5, reg_lambda = 1
),
),
]
)
# Fit the model on training data
xgb_tunedRS.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, eval_metric='logloss',
gamma=5, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=8, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.8, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tunedRS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(xgb_tunedRS, y_test)
Accuracy on training set : 0.9085560598419903 Accuracy on test set : 0.859822309970385 Recall on training set : 0.9598251807026391 Recall on test set : 0.9364754098360656 Precision on training set : 0.8705595365147126 Precision on test set : 0.5363849765258216 F1 on training set : 0.9130156699712185 F1 on test set : 0.682089552238806
#checking shape to see the number of features
X_train.shape
(7088, 24)
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), BaggingClassifier(random_state=1))
#Parameter grid to pass in GridSearchCV
param_grid={'baggingclassifier__n_estimators':np.arange(50,200,50),'baggingclassifier__max_features':[10,15,18,20,22],
'baggingclassifier__max_samples':[800,1000,1200]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'baggingclassifier__max_features': 22, 'baggingclassifier__max_samples': 1200, 'baggingclassifier__n_estimators': 150} with CV score=0.7190586598655229:
Wall time: 1min 51s
# Creating new pipeline with best parameters and creating tuned bagging classifier
bg_tunedGS = make_pipeline(
StandardScaler(),
BaggingClassifier(
random_state=1,
n_estimators=150,
max_features=22,
max_samples=1200,
),
)
# Fit the model on training data
bg_tunedGS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(max_features=22, max_samples=1200,
n_estimators=150, random_state=1))])
# Calculating different metrics
get_metrics_score(bg_tunedGS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(bg_tunedGS, y_test)
Accuracy on training set : 0.8775424441082534 Accuracy on test set : 0.941428101349128 Recall on training set : 0.7648344259539418 Recall on test set : 0.7069672131147541 Precision on training set : 0.9874131944444444 Precision on test set : 0.9078947368421053 F1 on training set : 0.8619873070000947 F1 on test set : 0.794930875576037
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),BaggingClassifier(random_state=1, n_estimators = 50))
#Parameter grid to pass in GridSearchCV
param_grid={'baggingclassifier__n_estimators':np.arange(50,200,50),'baggingclassifier__max_features':[10,15,18,20,22],
'baggingclassifier__max_samples':[800,1000,1200]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'baggingclassifier__n_estimators': 150, 'baggingclassifier__max_samples': 1200, 'baggingclassifier__max_features': 22} with CV score=0.7190586598655229:
Wall time: 1min 55s
# Creating new pipeline with best parameters and creating tuned bagging classifier
bg_tunedRS = make_pipeline(
StandardScaler(),
BaggingClassifier(
random_state=1,
n_estimators=150,
max_features=22,
max_samples=1200,
),
)
# Fit the model on training data
bg_tunedRS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(max_features=22, max_samples=1200,
n_estimators=150, random_state=1))])
# Calculating different metrics
get_metrics_score(bg_tunedRS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(bg_tunedRS, y_test)
Accuracy on training set : 0.8775424441082534 Accuracy on test set : 0.941428101349128 Recall on training set : 0.7648344259539418 Recall on test set : 0.7069672131147541 Precision on training set : 0.9874131944444444 Precision on test set : 0.9078947368421053 F1 on training set : 0.8619873070000947 F1 on test set : 0.794930875576037
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=1))
#Parameter grid to pass in GridSearchCV
param_grid={'decisiontreeclassifier__max_depth':np.arange(1,10),'decisiontreeclassifier__criterion':['entropy','gini'],
'decisiontreeclassifier__splitter':['best','random'],
'decisiontreeclassifier__min_impurity_decrease': [0.000001,0.00001],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
#Fitting parameters in GridSeachCV
grid_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__max_depth': 9, 'decisiontreeclassifier__min_impurity_decrease': 1e-06, 'decisiontreeclassifier__splitter': 'best'} with CV score=0.7523997217713888:
Wall time: 8.21 s
# Creating new pipeline with best parameters and creating tuned bagging classifier
dt_tunedGS = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(
max_depth=9,
criterion='entropy',
splitter='best',
min_impurity_decrease=0.000001
),
)
# Fit the model on training data
dt_tunedGS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(criterion='entropy', max_depth=9,
min_impurity_decrease=1e-06))])
# Calculating different metrics
get_metrics_score(dt_tunedGS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(dt_tunedGS, y_test)
Accuracy on training set : 0.9199865523617414 Accuracy on test set : 0.930898321816387 Recall on training set : 0.8522440746343923 Recall on test set : 0.7684426229508197 Precision on training set : 0.9858059498347268 Precision on test set : 0.7944915254237288 F1 on training set : 0.9141723764875586 F1 on test set : 0.7812499999999999
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),DecisionTreeClassifier(random_state=1))
#Parameter grid to pass in GridSearchCV
param_grid={'decisiontreeclassifier__max_depth':np.arange(1,10),'decisiontreeclassifier__criterion':['entropy','gini'],
'decisiontreeclassifier__splitter':['best','random'],
'decisiontreeclassifier__min_impurity_decrease': [0.000001,0.00001],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'decisiontreeclassifier__splitter': 'best', 'decisiontreeclassifier__min_impurity_decrease': 1e-05, 'decisiontreeclassifier__max_depth': 9, 'decisiontreeclassifier__criterion': 'entropy'} with CV score=0.7523997217713888:
Wall time: 5.33 s
# Creating new pipeline with best parameters and creating tuned bagging classifier
dt_tunedRS = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(
max_depth=9,
criterion='entropy',
splitter='best',
min_impurity_decrease=0.00001
),
)
# Fit the model on training data
dt_tunedRS.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('decisiontreeclassifier',
DecisionTreeClassifier(criterion='entropy', max_depth=9,
min_impurity_decrease=1e-05))])
# Calculating different metrics
get_metrics_score(dt_tunedRS,X_train_over,X_test,y_train_over,y_test)
# Creating confusion matrix
make_confusion_matrix(dt_tunedRS, y_test)
Accuracy on training set : 0.9209110774920155 Accuracy on test set : 0.933201711089174 Recall on training set : 0.8544293158514036 Recall on test set : 0.7766393442622951 Precision on training set : 0.985459480418767 Precision on test set : 0.8012684989429175 F1 on training set : 0.9152786531016476 F1 on test set : 0.7887617065556712
# defining list of model
models = [lr,log_reg_over,lr_estimator,log_reg_under,xgb_tunedGS,xgb_tunedRS,bg_tunedGS,bg_tunedRS,dt_tunedGS,dt_tunedRS]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
f1_train.append(j[6])
f1_test.append(j[7])
comparison_frame = pd.DataFrame({'Model':['Logistic Regression','Logistic Regression on Oversampled data',
'Logistic Regression-Regularized','Logistic Regression on Undersampled data',
'XGBoost Grid Search','XGBoost Random Search','Bagging Classifier Grid Search',
'Bagging Classifier Random Search','Decision Tree Grid Search','Decision Tree Random Search'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1':f1_train,'Test_F1':f1_test
})
#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | XGBoost Random Search | 0.875282 | 0.859822 | 0.969271 | 0.936475 | 0.565284 | 0.536385 | 0.714101 | 0.682090 |
| 4 | XGBoost Grid Search | 0.896021 | 0.880882 | 0.971905 | 0.922131 | 0.610927 | 0.581395 | 0.750254 | 0.713154 |
| 9 | Decision Tree Random Search | 0.974464 | 0.933202 | 0.906936 | 0.776639 | 0.932310 | 0.801268 | 0.919448 | 0.788762 |
| 8 | Decision Tree Grid Search | 0.974464 | 0.930898 | 0.905180 | 0.768443 | 0.933877 | 0.794492 | 0.919305 | 0.781250 |
| 3 | Logistic Regression on Undersampled data | 0.757195 | 0.758144 | 0.743635 | 0.709016 | 0.372144 | 0.368477 | 0.496047 | 0.484933 |
| 6 | Bagging Classifier Grid Search | 0.961061 | 0.941428 | 0.808604 | 0.706967 | 0.940756 | 0.907895 | 0.869688 | 0.794931 |
| 7 | Bagging Classifier Random Search | 0.961061 | 0.941428 | 0.808604 | 0.706967 | 0.940756 | 0.907895 | 0.869688 | 0.794931 |
| 1 | Logistic Regression on Oversampled data | 0.781179 | 0.781507 | 0.669008 | 0.639344 | 0.393595 | 0.390000 | 0.495610 | 0.484472 |
| 2 | Logistic Regression-Regularized | 0.795852 | 0.785785 | 0.549605 | 0.508197 | 0.401282 | 0.376328 | 0.463876 | 0.432432 |
| 0 | Logistic Regression | 0.879797 | 0.871997 | 0.366111 | 0.331967 | 0.762340 | 0.720000 | 0.494662 | 0.454418 |
feature_names = X_train.columns
importances = xgb_tunedRS[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()